The prominent inequality of wealth and income is a huge concern especially in the United States. The likelihood of diminishing poverty is one valid reason to reduce the world’s surging level of economic inequality. The principle of universal moral equality ensures sustainable development and improves the economic stability of a nation. Governments in different countries have been trying their best to address this problem and provide an optimal solution. The aim here to show the usage of machine learning techniques in providing a solution to the income equality problem. The UCI Adult Dataset has been used for the purpose. Specifically, several machine learning classification models have been compared to predict whether a person’s yearly income in the US falls in the income category of either greater than 50K dollars or less/equal to 50K dollars category based on a certain set of attributes. So, what_Y(>50, <=50) is predicted given (X1, X2, X3,…Xn), where Y is an income level, and X is a statistic feature of an individual.
library(knitr)
library(ggvis)
library(ISLR)
library(e1071)
library(gmodels)
library(tidyverse)
library(tidyr)
library(dplyr)
library(readr)
library(ggplot2)
library(randomForest)
library(caret)
library(data.table)
library(gbm)
library(rpart)
library(rpart.plot)
library(plotly)
library(ggvis)
library(neuralnet)
library(nnet)
library(MASS)
library(devtools)
install_github('araastat/reprtree', force = TRUE)
##
##
checking for file ‘/private/var/folders/xc/v9gh8m7j65n8yj0gq2_m5v8r0000gn/T/Rtmp5w1Ym4/remotes3ea940b2b/araastat-reprtree-5cff507/DESCRIPTION’ ...
✔ checking for file ‘/private/var/folders/xc/v9gh8m7j65n8yj0gq2_m5v8r0000gn/T/Rtmp5w1Ym4/remotes3ea940b2b/araastat-reprtree-5cff507/DESCRIPTION’ (384ms)
##
─ preparing ‘reprtree’:
##
checking DESCRIPTION meta-information ...
✔ checking DESCRIPTION meta-information
##
─ checking for LF line-endings in source and make files and shell scripts
##
─ checking for empty or unneeded directories
##
─ building ‘reprtree_0.6.tar.gz’
##
##
library(reprtree)
library(rattle)
library(NeuralNetTools)
url.train <- "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
url.test <- "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test"
url.names <- "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names"
download.file(url.train, destfile = "adult_train.csv")
download.file(url.test, destfile = "adult_test.csv")
download.file(url.names, destfile = "adult_names.txt")
# Read the training and test data into memory
train <- read.csv("adult_train.csv", header = FALSE)
# The test data has an unnecessary first line that messes stuff up, this fixes that problem
all_content <- readLines("adult_test.csv")
skip_first <- all_content[-1]
test <- read.csv(textConnection(skip_first), header = FALSE)
feature <- c("Age","Work_Class", "Final_Weight", "Education", "Education_Num", "Marital_Status", "Occupation", "Relationship", "Race", "Sex", "Capital_Gain", "Capital_Loss", "Hours_Per_Week", "Native_Country", "IncomeCLASS")
##
## TRUE
## 32561
##
## Under_50K More_50K
## 76 24
## [1] 0
## [1] 0
set.seed(1414)
model.lda <- train(IncomeCLASS ~ .,
data = train,
method = "lda")
plot(varImp(model.lda))
P <- ggplot(train,aes(x = Age, fill = IncomeCLASS)) + geom_bar(position = "fill", color = "black") + coord_flip() + theme(axis.text = element_text(colour = "darkred", size = rel(0.75), angle = 45))
P1 <- P + labs(title = "Age vs Income Class Proportion")
ggplotly(P1 = ggplot2::last_plot())
gg <- train %>% ggvis(x = ~Age) %>% layer_histograms(fill = "green")
gg
Q <- ggplot(train,aes(x = Education, fill = IncomeCLASS)) + geom_bar() + coord_flip()
Q1 <- Q + labs(title = "Education vs Income Class Proportion") + theme(axis.text = element_text(colour = "darkred", size = rel(0.75), angle = 45))
ggplotly(Q1 = ggplot2::last_plot())
train %>% ggvis(x = ~Age) %>% layer_histograms() %>% layer_paths( strokeWidth := 1, stroke := "red")
R <- train %>% ggplot(aes(x = Work_Class, fill = IncomeCLASS)) + geom_bar(position = "fill", color = "black") + theme(axis.text = element_text(colour = "darkred", size = rel(0.75), angle = 45))
R1 <- R + labs(title = "Work_Class vs Income Class Proportion")
ggplotly(R1 = ggplot2::last_plot())
We find that the people employed in private companies have more people with income above 50k and Self Employed people having a higher proportion of peoplw with income greater than 50k.
hh <- train %>% ggvis(x = ~Education_Num) %>% layer_histograms(fill = "green")
hh
S <- ggplot(train,aes(x = Education_Num, fill = IncomeCLASS)) + xlab("Years of Education") + ylab("proportion within category") + geom_bar(color = "black") + coord_flip() + theme(axis.text = element_text(colour = "darkred", size = rel(0.75), angle = 45))
S1 <- S + labs(title = "Length of Eduction VS Income Class Proportion")
ggplotly(S1 = ggplot2::last_plot())
T <- ggplot(train,aes(x = Relationship, fill = IncomeCLASS)) + ggtitle("Relationship VS Income Class Proportion") + xlab("Relationship") + ylab("proportion within category") + geom_bar(position = "fill", color = "black") + coord_flip() + theme(axis.text = element_text(colour = "darkred", size = rel(0.75), angle = 45))
T1 <- T + labs(title = "Relationship vs Income Class Proportion")
ggplotly(T1 = ggplot2::last_plot())
jj <- ggplot(train,aes(x = Occupation, fill = IncomeCLASS)) + ggtitle("Occupation VS Income Class Proportion") + xlab("Occupation") + ylab("proportion within category") + geom_bar(position = "fill", color = "black") + coord_flip() + theme(axis.text = element_text(colour = "darkred", size = rel(0.75), angle = 45))
ggplotly(jj = ggplot2::last_plot())
#train Naive Bayes
set.seed(32323)
model_Naive <- naiveBayes(IncomeCLASS ~ ., data = train)
pred_Nb <- predict(model_Naive, test)
levels(train$IncomeCLASS) <- c("Under_50K", "More_50K")
levels(test$IncomeCLASS) <- c("Under_50K", "More_50K")
levels(pred_Nb) <- c("Under_50K", "More_50K")
confusionMatrix(pred_Nb, test$IncomeCLASS)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Under_50K More_50K
## Under_50K 11567 1978
## More_50K 868 1868
##
## Accuracy : 0.8252
## 95% CI : (0.8193, 0.831)
## No Information Rate : 0.7638
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4619
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9302
## Specificity : 0.4857
## Pos Pred Value : 0.8540
## Neg Pred Value : 0.6827
## Prevalence : 0.7638
## Detection Rate : 0.7105
## Detection Prevalence : 0.8320
## Balanced Accuracy : 0.7079
##
## 'Positive' Class : Under_50K
##
CrossTable(pred_Nb, test$IncomeCLASS)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 16281
##
##
## | test$IncomeCLASS
## pred_Nb | Under_50K | More_50K | Row Total |
## -------------|-----------|-----------|-----------|
## Under_50K | 11567 | 1978 | 13545 |
## | 144.270 | 466.457 | |
## | 0.854 | 0.146 | 0.832 |
## | 0.930 | 0.514 | |
## | 0.710 | 0.121 | |
## -------------|-----------|-----------|-----------|
## More_50K | 868 | 1868 | 2736 |
## | 714.229 | 2309.267 | |
## | 0.317 | 0.683 | 0.168 |
## | 0.070 | 0.486 | |
## | 0.053 | 0.115 | |
## -------------|-----------|-----------|-----------|
## Column Total | 12435 | 3846 | 16281 |
## | 0.764 | 0.236 | |
## -------------|-----------|-----------|-----------|
##
##
summary(pred_Nb)
## Under_50K More_50K
## 13545 2736
cm_Nb <- data.frame(confusionMatrix(pred_Nb, test$IncomeCLASS)[3])
kable(cm_Nb)
| overall | |
|---|---|
| Accuracy | 0.8251950 |
| Kappa | 0.4619397 |
| AccuracyLower | 0.8192724 |
| AccuracyUpper | 0.8310012 |
| AccuracyNull | 0.7637737 |
| AccuracyPValue | 0.0000000 |
| McnemarPValue | 0.0000000 |
# rpart decision tree
set.seed(32323)
model_part <- caret::train(IncomeCLASS ~ ., data = train, method = "rpart")
pred_rpart <- predict(model_part, test, type = "raw")
confusionMatrix(pred_rpart, test$IncomeCLASS)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Under_50K More_50K
## Under_50K 11808 2042
## More_50K 627 1804
##
## Accuracy : 0.8361
## 95% CI : (0.8303, 0.8417)
## No Information Rate : 0.7638
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4796
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9496
## Specificity : 0.4691
## Pos Pred Value : 0.8526
## Neg Pred Value : 0.7421
## Prevalence : 0.7638
## Detection Rate : 0.7253
## Detection Prevalence : 0.8507
## Balanced Accuracy : 0.7093
##
## 'Positive' Class : Under_50K
##
cm_rpart <- data.frame(confusionMatrix(pred_rpart, test$IncomeCLASS)[3])
CrossTable(pred_rpart, test$IncomeCLASS)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 16281
##
##
## | test$IncomeCLASS
## pred_rpart | Under_50K | More_50K | Row Total |
## -------------|-----------|-----------|-----------|
## Under_50K | 11808 | 2042 | 13850 |
## | 142.958 | 462.215 | |
## | 0.853 | 0.147 | 0.851 |
## | 0.950 | 0.531 | |
## | 0.725 | 0.125 | |
## -------------|-----------|-----------|-----------|
## More_50K | 627 | 1804 | 2431 |
## | 814.465 | 2633.353 | |
## | 0.258 | 0.742 | 0.149 |
## | 0.050 | 0.469 | |
## | 0.039 | 0.111 | |
## -------------|-----------|-----------|-----------|
## Column Total | 12435 | 3846 | 16281 |
## | 0.764 | 0.236 | |
## -------------|-----------|-----------|-----------|
##
##
summary(pred_rpart)
## Under_50K More_50K
## 13850 2431
kable(cm_rpart)
| overall | |
|---|---|
| Accuracy | 0.8360666 |
| Kappa | 0.4795717 |
| AccuracyLower | 0.8302892 |
| AccuracyUpper | 0.8417236 |
| AccuracyNull | 0.7637737 |
| AccuracyPValue | 0.0000000 |
| McnemarPValue | 0.0000000 |
accuracy <- sum(pred_rpart == test$IncomeCLASS)/length(test$IncomeCLASS)
print(accuracy)
## [1] 0.8360666
rpart.plot(model_part$finalModel)
set.seed(32323)
model.rf <- randomForest(IncomeCLASS ~ ., data = train, ntree = 500, importance = TRUE, mtry = 2, do.trace = 100)
## ntree OOB 1 2
## 100: 13.40% 5.93% 36.95%
## 200: 13.35% 5.86% 36.99%
## 300: 13.30% 5.81% 36.88%
## 400: 13.31% 5.87% 36.79%
## 500: 13.34% 5.94% 36.67%
pred.rf <- predict(model.rf, test)
summary(pred.rf)
## Under_50K More_50K
## 13107 3174
reprtree:::plot.getTree(model.rf, k = 3, depth = 4)
confusionMatrix(test$IncomeCLASS, pred.rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Under_50K More_50K
## Under_50K 11670 765
## More_50K 1437 2409
##
## Accuracy : 0.8648
## 95% CI : (0.8594, 0.87)
## No Information Rate : 0.805
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6011
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8904
## Specificity : 0.7590
## Pos Pred Value : 0.9385
## Neg Pred Value : 0.6264
## Prevalence : 0.8050
## Detection Rate : 0.7168
## Detection Prevalence : 0.7638
## Balanced Accuracy : 0.8247
##
## 'Positive' Class : Under_50K
##
cm.rf <- data.frame(confusionMatrix(pred.rf, test$IncomeCLASS)[3])
kable(cm.rf)
| overall | |
|---|---|
| Accuracy | 0.8647503 |
| Kappa | 0.6011184 |
| AccuracyLower | 0.8594013 |
| AccuracyUpper | 0.8699687 |
| AccuracyNull | 0.7637737 |
| AccuracyPValue | 0.0000000 |
| McnemarPValue | 0.0000000 |
set.seed(32323)
keeps <- c("Education_Num",
"Age",
"Hours_Per_Week",
"Sex",
"Capital_Gain",
"IncomeCLASS")
train.reduced <- train[,which(names(train) %in% keeps)]
test.reduced <- test[,which(names(test) %in% keeps)]
start <- proc.time()[3]
model.nn <- train(IncomeCLASS ~ .,
data = train.reduced,
method = "nnet")
par(mar = c(1, 1, 1, 1))
plotnet(model.nn$finalModel, y_names = "IncomeCLASS")
title("Graphical Representation of our Neural Network")
predictions <- predict(model.nn, test.reduced[,1:5])
accuracy <- sum(predictions == test.reduced[,6])/length(test.reduced[,6])
print(accuracy)
cm.neural <- data.frame(confusionMatrix(predictions, test.reduced$IncomeCLASS)[3])
kable(cm.neural)
garson(model.nn$finalModel)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Under_50K More_50K
## Under_50K 11062 1373
## More_50K 1165 2681
##
## Accuracy : 0.8441
## 95% CI : (0.8384, 0.8497)
## No Information Rate : 0.751
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5759
## Mcnemar's Test P-Value : 3.976e-05
##
## Sensitivity : 0.9047
## Specificity : 0.6613
## Pos Pred Value : 0.8896
## Neg Pred Value : 0.6971
## Prevalence : 0.7510
## Detection Rate : 0.6794
## Detection Prevalence : 0.7638
## Balanced Accuracy : 0.7830
##
## 'Positive' Class : Under_50K
##
| overall | |
|---|---|
| Accuracy | 0.8441128 |
| Kappa | 0.5759170 |
| AccuracyLower | 0.8384486 |
| AccuracyUpper | 0.8496537 |
| AccuracyNull | 0.7637737 |
| AccuracyPValue | 0.0000000 |
| McnemarPValue | 0.0000398 |
| overall | |
|---|---|
| Accuracy | 0.8441128 |
| Kappa | 0.5759170 |
| AccuracyLower | 0.8384486 |
| AccuracyUpper | 0.8496537 |
| AccuracyNull | 0.7637737 |
| AccuracyPValue | 0.0000000 |
| McnemarPValue | 0.0000398 |
kable(cm_Nb)
| overall | |
|---|---|
| Accuracy | 0.8251950 |
| Kappa | 0.4619397 |
| AccuracyLower | 0.8192724 |
| AccuracyUpper | 0.8310012 |
| AccuracyNull | 0.7637737 |
| AccuracyPValue | 0.0000000 |
| McnemarPValue | 0.0000000 |
kable(cm_rpart)
| overall | |
|---|---|
| Accuracy | 0.8360666 |
| Kappa | 0.4795717 |
| AccuracyLower | 0.8302892 |
| AccuracyUpper | 0.8417236 |
| AccuracyNull | 0.7637737 |
| AccuracyPValue | 0.0000000 |
| McnemarPValue | 0.0000000 |
#### NEURAL MODEL >>> model_neural_AA <- train(IncomeCLASS ~ ., data = train, method = “nnet”, trControl =
kable(cm.rf)
| overall | |
|---|---|
| Accuracy | 0.8647503 |
| Kappa | 0.6011184 |
| AccuracyLower | 0.8594013 |
| AccuracyUpper | 0.8699687 |
| AccuracyNull | 0.7637737 |
| AccuracyPValue | 0.0000000 |
| McnemarPValue | 0.0000000 |
| overall| | |
| |:————–|———:| | |
| |Accuracy | 0.8643818| | |
| |Kappa | 0.6007945| | |
| |AccuracyLower | 0.8590268| | |
| |AccuracyUpper | 0.8696063| | |
| |AccuracyNull | 0.7637737| | |
| |AccuracyPValue | 0.0000000| | |
| |McnemarPValue | 0.0000000| |
In this binary context the random forest model performs best to classiy income levels. If the complexity or dimensionality were greater, I would neural neetwork would close the accurscy gap.
The analysis confirmed (and quantified) what is considered common sense:
Age, education, occupation, and marital status (or relationship kind) are good for predicting income (above a certain threshold).
About 46% of the people are in a relationship called “Husband” or “Wife” which is then further classified based on Education Level where nearly 14% who earn above $50 K have the education of Bachelors, Prof-school, Masters and Doctorate.
The other education levels have income predominantly below $50 k with just 2% having salaries above $50k who also have capital gains greater than $5096
With respect to other relationships, only 1% have income above $50 k and with capital gains greater than $7074.
In the relationship of Education and Number of People Earning > 50 k and separated by Work Class. We find that Bachelors graduates working in Private companies have a higher number of people earning above 50 k.
In the relationship of Average hours per week with respect to gender and separated by Work Class and we find that Males typically work more hours per week on Average across all work classes.
In the relationship of marital status and income levels separated by Work Class, the majority of the people in Married with Civilian spouse have an income greater than 50 k and majorly in the private sector.
A takeaway in looking at implact of occupation, capital gain and capital loss on the income; we find that Executives at Managerial Level have more people with income greater than 50k and Professional Speciality has more capital gains.